McGill University

R Tutorial

Guilherme D. Garcia

DAY 3: Data visualization

Today’s topic is related to Exploratory Data Analysis (EDA), which is usually the step that comes before the statistical analysis.

We’ll be using ggplot2, a widely used R package for plotting data. Other packages are also quite useful: graphics, rCharts, lattice etc. These packages do similar things, but some offer more tools. ggplot2 has a huge library of plots. Crucially, it has a very nice documentation (also, because it’s used a lot, you will find # many forums/websites about it).

This time we’ll use the english data set, from # the languageR package. There are continuous and categorical # variables in the data, and that’s useful for plotting.

NB: Last time we also had some EDA, when we used xtabs etc.

Load the packages and the english dataset, which is included in the languageR package:

library(ggplot2)
library(languageR)
library(plyr)
library(scales) # for % in plots

data(english)

Let’s use a very short name for our data this time

d = english

This dataset has a lot of columns, so let’s first create a more concise version (we’ll overwrite d).

names(d)
##  [1] "RTlexdec"                        "RTnaming"                       
##  [3] "Familiarity"                     "Word"                           
##  [5] "AgeSubject"                      "WordCategory"                   
##  [7] "WrittenFrequency"                "WrittenSpokenFrequencyRatio"    
##  [9] "FamilySize"                      "DerivationalEntropy"            
## [11] "InflectionalEntropy"             "NumberSimplexSynsets"           
## [13] "NumberComplexSynsets"            "LengthInLetters"                
## [15] "Ncount"                          "MeanBigramFrequency"            
## [17] "FrequencyInitialDiphone"         "ConspelV"                       
## [19] "ConspelN"                        "ConphonV"                       
## [21] "ConphonN"                        "ConfriendsV"                    
## [23] "ConfriendsN"                     "ConffV"                         
## [25] "ConffN"                          "ConfbV"                         
## [27] "ConfbN"                          "NounFrequency"                  
## [29] "VerbFrequency"                   "CV"                             
## [31] "Obstruent"                       "Frication"                      
## [33] "Voice"                           "FrequencyInitialDiphoneWord"    
## [35] "FrequencyInitialDiphoneSyllable" "CorrectLexdec"
columns = c("RTlexdec", "Familiarity", "Word", "AgeSubject", "LengthInLetters", "CV", "Obstruent", "Voice")

d = d[,columns]

head(d)
##   RTlexdec Familiarity   Word AgeSubject LengthInLetters CV Obstruent
## 1 6.543754        2.37    doe      young               3  C      obst
## 2 6.397596        4.43  whore      young               5  C      obst
## 3 6.304942        5.60 stress      young               6  C      obst
## 4 6.424221        3.87   pork      young               4  C      obst
## 5 6.450597        3.93   plug      young               4  C      obst
## 6 6.531970        3.27   prop      young               4  C      obst
##       Voice
## 1    voiced
## 2 voiceless
## 3 voiceless
## 4 voiceless
## 5 voiceless
## 6 voiceless

This looks good for plotting.

RTlexdec is a numeric vector of the log RT in visual lexical decision. This will be our variable of interest (response), so we’ll be plotting other variables as possible predictors of RTlexdec.

EXERCISE

Review from last time (also related to EDA).

In what follows, create a code that explores each question. You’ll need xtabs(), prop.table() and ddply().

1. What’s the proportion of voiced and voiceless segments (initial phoneme)?

prop.table(xtabs(~Voice, d))
## Voice
##    voiced voiceless 
## 0.4509632 0.5490368

2. Create a contingency table that includes Voice and CV. % by Voice (ignore the obvious correlation here)

prop.table(xtabs(~Voice + CV, d),1)
##            CV
## Voice               C         V
##   voiced    0.9407767 0.0592233
##   voiceless 1.0000000 0.0000000

3. Create a data frame that shows the mean RT and the mean Familiarity by AgeSubject and LengthInLetters. What impact do these two variables have on the mean reaction time? How about Familiarity and LengthInLetters?

ddply(d,.(AgeSubject, LengthInLetters), summarise, meanFam=mean(Familiarity), meanRT=mean(RTlexdec))
##    AgeSubject LengthInLetters  meanFam   meanRT
## 1         old               2 6.436667 6.609008
## 2         old               3 3.922692 6.650408
## 3         old               4 3.827178 6.659876
## 4         old               5 3.736024 6.658479
## 5         old               6 3.581834 6.696432
## 6         old               7 3.638333 6.717936
## 7       young               2 6.436667 6.299283
## 8       young               3 3.922692 6.423638
## 9       young               4 3.827178 6.440874
## 10      young               5 3.736024 6.439958
## 11      young               6 3.581834 6.458087
## 12      young               7 3.638333 6.465111

{ggplot2}

Documentation available at http://docs.ggplot2.org/

Syntax

There are two main ways to use ggplot. The quick method (less intuitive), and the longer method (easier to remember).

  • qplot() is the ‘quick version’

Read about it here: http://docs.ggplot2.org/0.9.3/qplot.html

Fundamentals

Scatter plot

This is an INCREMENTAL example: qplot(variable1, variable2, data)

Let’s plot RTlexdec as a function of Familiarity (and other vars)

qplot(Familiarity, RTlexdec, data=d)

qplot(Familiarity, RTlexdec, data=d, colour=AgeSubject)

qplot(Familiarity, RTlexdec, data=d, colour=AgeSubject, size=LengthInLetters)

qplot(Familiarity, RTlexdec, data=d, colour=AgeSubject, size=LengthInLetters, alpha=I(0.1))

qplot(Familiarity, RTlexdec, data=d, colour=AgeSubject, size=LengthInLetters, alpha=I(0.2), facets = Voice ~ Obstruent)

  • ggplot() is the longer version (this is what we’ll be using here)

Same incremental example: ggplot(data, aes(axes)) + ...

ggplot(data=d, aes(x=Familiarity, y=RTlexdec)) + geom_point()

ggplot(data=d, aes(x=Familiarity, y=RTlexdec, color=AgeSubject)) + geom_point()

ggplot(data=d, aes(x=Familiarity, y=RTlexdec, color=AgeSubject, size=LengthInLetters)) + geom_point()

ggplot(data=d, aes(x=Familiarity, y=RTlexdec, color=AgeSubject, size=LengthInLetters)) + geom_point(alpha=0.2)

ggplot(data=d, aes(x=Familiarity, y=RTlexdec, color=AgeSubject, size=LengthInLetters)) + geom_point(alpha=0.2) + facet_grid(Voice ~ Obstruent)

The titles in each facet will be inherited from the factor levels. So you can change them if you wish (but this will required changing the level names).

Alpha goes from 0 to 1.

You can see that creating a plot with ggplot() is like adding components/layers.

First, you tell R the basics (data and axes). Then, you choose the method. For example, scatter plots = geom_point()

Each TYPE of plot you wish to create is generated by a geom_...():

  • boxplot: + geom_boxplot()
  • histograms: + geom_histogram()
  • bar plots: + geom_bar()
  • scatter plots: + geom_point()
  • spread points: + geom_jitter()
  • lines: + geom_line()
  • violin plots: + geom_violin()
  • density plots: + geom_density()
  • text plots: + geom_text()

etc.

  • annotation: + annotate()

  • pie charts: + coord_polar()

Check http://docs.ggplot2.org/ for a full list.

In fact, you can add layers that don’t necessarily make sense, but this shows you the amount of control that you have:

ggplot(data=d, (aes(x=AgeSubject, y=RTlexdec))) + geom_jitter(color='brown', alpha=0.3) + geom_boxplot() + geom_violin(alpha=0.3, fill='blue')

This plot is just totally off, but it’s generated without warnings or errors. Layering is a great aspect of plotting in R, but R won’t tell you which plots make more/less sense.

Also, note that layers are built in the order you add them.

Density plots

ggplot(data=d, aes(x=RTlexdec)) + geom_density(fill="blue", color="blue", alpha=0.5) + geom_rug() 

# You can flip your plot by adding + coord_flip():

ggplot(data=d, aes(x=RTlexdec)) + geom_density(fill="blue", color="blue", alpha=0.5) + geom_rug() + coord_flip()

Whenever you want to know which arguments a geom_... takes:

  • type ?geom_...
  • Usually, google will be more useful, though.

This plots the density of RTlexdec and also adds a geom_rug(), so you can see where the data points are more concentrated.

Boxplots

Let’s say you have a categorical variable in the x axis and a continuous variable in the y axis.

ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot()

What if you want to add the data points on top of the boxplot?

ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot() + geom_jitter()

This doesn’t look that great, though. Add transparency (alpha):

ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot() + geom_jitter(alpha=0.1)

Adding facets to check for interactions

ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot() + geom_jitter(alpha=0.1) + facet_grid(~CV)

Barplots

Histogram

geom_bar() and geom_histogram() can give you the same output:

ggplot(data=d, aes(x=Obstruent, y=(..count..)/sum(..count..))) + geom_bar() + scale_y_continuous(labels = percent_format()) + labs(x="Type of segment", y="%")

ggplot(data=d, aes(x=Obstruent, y=(..count..)/sum(..count..))) + geom_histogram() + scale_y_continuous(labels = percent_format()) + labs(x="Type of segment", y="%")

Adding ERROR BARS (standard error): a manual way to do it:

You probably won’t add error bars like this, so this is just to show you how you could in principle do it.

First, we need to create a function that calculates std errors:

sem = function(x){
    n = sum(x,na.rm=T)/mean(x,na.rm=T)
    sqrt(var(x, na.rm=T)/n)
}

Don’t worry about syntax here; we’ll look into functions next time.

Now we’ll extract the mean RTlexdec by age group:

data = ddply(d,.(AgeSubject), summarise, meanRT = mean(RTlexdec))

Finally, we’ll add the std errors to our new data frame.

data$SE = NA

data[1,'SE'] = sem(d[d$AgeSubject == 'old',]$RTlexdec)
data[2,'SE'] = sem(d[d$AgeSubject == 'young',]$RTlexdec)

This is what our input looks like now:

data
##   AgeSubject   meanRT          SE
## 1        old 6.660958 0.002418593
## 2      young 6.439237 0.002224914
ggplot(data=data, aes(x=AgeSubject, y=meanRT)) + geom_errorbar(width = 0.2, aes(ymax = meanRT + SE, ymin = meanRT - SE)) + geom_bar(stat='identity', alpha=0.4)

You can see error bars are tiny here (note the y-axis scale)

You can also only use a point + error bars, which makes more sense:

ggplot(data=data, aes(x=AgeSubject, y=meanRT)) + geom_errorbar(width = 0.2, aes(ymax = meanRT + SE, ymin = meanRT - SE)) + geom_point()

Read more: http://docs.ggplot2.org/0.9.3.1/geom_errorbar.html

Whenever you have you a question on ggplot2, google it. A good place: http://stackoverflow.com/questions/tagged/ggplot2

BUT

YOU DON’T NEED TO DO IT MANUALLY:

ourPlot = ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) 

Let’s assign the basic bit of the plot to a variable to be concise

  • Plotting sd:
ourPlot + stat_summary(fun.y = sd, geom = "point", position="dodge", size=4, shape=5)
## ymax not defined: adjusting position using y instead

  • Plotting mean: (ylab will be adjusted)
ourPlot + stat_summary(fun.y = mean, geom = "point", position="dodge", size=4, shape=5)
## ymax not defined: adjusting position using y instead

To get a shape chart, visit: http://www.cookbook-r.com/Graphs/Shapes_and_line_types/

You can also simply use a letter (only ONE), but use "":

ourPlot + stat_summary(fun.y = mean, geom = "point", position="dodge", size=10, shape="A")
## ymax not defined: adjusting position using y instead

ERROR BAR:

ourPlot + stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width=0.1)

ourPlot + stat_summary(fun.data = mean_cl_normal, geom = "pointrange", width=0.1) 

You can add the mult argument to stat_summary, which controls the No. of sd from the mean you want to plot.

ourPlot + stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width=0.1, mult=5)

How would you create the same plot as a function of the CV variable?

ourPlot + stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width=0.1, mult=5) + facet_grid(~CV)

As you can see, stat_summary() is incredibly useful.

Text and annotation

Note: Many of the examples that follow wouldn’t normally be used.

Text instead of points

Let’s say you want to plot the actual words (i.e., values of a var). You also want to add a green (!) trend line.

ggplot(data=d[d$LengthInLetters==3,], aes(x=Familiarity, y=RTlexdec)) + geom_text(aes(label=Word), alpha=0.5) + geom_smooth(color="green", size=2)
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

geom_smooth() also includes the standard error (gray area). You can remove it, though.

ggplot(data=d[d$LengthInLetters==3,], aes(x=Familiarity, y=RTlexdec)) + geom_text(aes(label=Word), alpha=0.5) + geom_smooth(color="green", size=2, se=F)
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

You can also use other statistical methods to define the trend line. For example, let’s say you want to use a (linear) regression line:

ggplot(data=d[d$LengthInLetters==3,], aes(x=Familiarity, y=RTlexdec)) + geom_text(aes(label=Word), alpha=0.5) + geom_smooth(color="green", size=2, method='lm')

Annotation

What if you want to write on your plot?

ggplot(data=d[d$LengthInLetters==3,], aes(x=Familiarity, y=RTlexdec)) + geom_text(aes(label=Word), alpha=0.5) + geom_smooth(color="green", size=2) + annotate("text", x=6, y=7, label="Test", fontface="bold", size=9, color="red")
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

So you can provide x and y coordinates, fontsize, color etc.

Formatting

Visit http://docs.ggplot2.org/0.9.2.1/theme.html

Title and labels
ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot() + geom_jitter(alpha=0.1) + ggtitle("My plot") + ylab("log(RT)") + xlab("Subject age")

Colors

Visit http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/

The package {RColorBrewer} is quite impressive, so if you’re into colors, check it out.

ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot() + geom_jitter(alpha=0.1, color="blue") + ggtitle("My plot") + ylab("log(RT)") + xlab("Subject age")

ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot() + geom_jitter(alpha=0.2, aes(color=Voice)) + ggtitle("My plot") + ylab("log(RT)") + xlab("Subject age")

You can always change the colors, of course (you can even create your own)

Background

Everything about ggplot can be formatted. You might want to have a white background (better for publication):

ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot() + geom_jitter(alpha=0.1) + ggtitle("My plot") + ylab("log(RT)") + xlab("Subject age") + theme_bw()

You can play around with the arguments in geom_boxplot().

ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot(fill="lightblue") + geom_jitter(alpha=0.1) + ggtitle("My plot") + ylab("log(RT)") + xlab("Subject age") + theme_bw()

Colors are usually avoided, but sometimes they can help a lot.

IMPORTANT:

  1. geom_boxplot(aes(THESE ARGUMENTS)) refer to the data

  2. geom_boxplot(aes(), THESE ARGUMENTS) refer to independent parameters

For example, if you want the plot to be red, use (2). If you want colors to represent the age of the subjects, use (1).

Spacing

Again, there are several ways to change the distances in your plot. The easiest (simplest) way to add more dist. between labels and the plot is by adding \n to your strings.

ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot(width=0.4, size=1) + geom_jitter(alpha=0.1) + ggtitle("My plot\n") + ylab("log(RT)\n") + xlab("Subject age\n") + theme_bw()

You can also use \n to break lines in labels

ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot() + geom_jitter(alpha=0.1) + ggtitle("My title\nhas two lines\n") + ylab("log(RT)\n") + xlab("Subject age\n") + theme_bw()

If you google ‘spacing in ggplot’, you’ll find hundreds of pages.

If you need something more complete, check the theme() function:

http://docs.ggplot2.org/0.9.3/theme.html

Fonts

If you don’t like R fonts, you’ll need to install a package and then load it. Once you’ve done that, you can use different font faces INSIDE theme().

library(extrafont)
## Registering fonts with R

You don’t have to do that to change font size, though.

IMPORTANT: Remember that the font sizes interact with the plot window resolution when you save your plot.

ggplot(data=d, aes(x=AgeSubject, y=RTlexdec)) + geom_boxplot() + geom_jitter(alpha=0.1) + ggtitle("My title\nhas two lines\n") + ylab("log(RT)\n") + xlab("Subject age\n") + theme_bw() + theme(text=element_text(size=20, family="CMU Sans Serif"))